The dataset was obtained from Kaggle. It had 299 observations and 13 variables. the outcome variable ‘DEATH_EVENT’ indicates whether a patient died of heart failure or not based on 11 other predictors. The variable names are shown below:
NB: The 12th variable ‘time’ indicated the time from the start of the study after which the study was terminated. This,presumably,could be either because the subject was declared healthy, or dropped out of the study for various reasons, or died from heart failure. To avoid target leakage, since that time would not be available in real world instances when the resultant model is being used to predict the outcome of a new case, the ‘time’ variable would not be used as a feature to train the model.
## Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
## 'ejection_fraction', 'high_blood_pressure', 'platelets',
## 'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
## 'DEATH_EVENT'],
## dtype='object')
We can do a quick overview of the only two demographic variables from the dataset: age and sex. from the output below, we realize that the age range of the respondents is 40 to 95 years with a median age of 60 years and an average age of approximately 60 years.
## count 299.000000
## mean 60.833893
## std 11.894809
## min 40.000000
## 25% 51.000000
## 50% 60.000000
## 75% 70.000000
## max 95.000000
## Name: age, dtype: float64
We are closer to our goal of comparing the performance of various ML models on the dataset. Features here are pre-selected based on domain knowledge.First, let us check our outcome variable. In our dataset, the proportion of “No” examples for our outcome variable is much higher than “Yes” examples. The main challenge with imbalanced dataset prediction is how accurately the ML model would predict both majority and minority classes. Thus, there is the danger of our ML algorithms being biased if trained on this data as they would have way more “No” examples to learn from.
We would solve this imbalance with some some feature engineering with Synthetic Minority Oversampling Technique (SMOTE). SMOTE utilizes a k-nearest neighbour algorithm helps to overcome the overfitting problem that might occur if we use random oversampling. I chose SMOTE instead of Random undersampling of the majority class because I want to preserve the data and not eliminate any examples since I do not have much training data to begin with!
First, we identify features with low variance since they would not help the model much in finding patterns and de-select them. We will also check if there is multicollinearity amongst any of the features and we de-select one per pair.
VarianceThreshold(threshold=0.15)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
VarianceThreshold(threshold=0.15)
## array([ True, True, True, True, True, True, True, True])
From the results, per our threshold criteria, all the features have high variance (< 85% similarity amongst values).
## age anaemia ... high_blood_pressure smoking
## age 1.000000 0.006208 ... 0.073552 0.008866
## anaemia 0.006208 1.000000 ... 0.023697 -0.084894
## creatinine_phosphokinase -0.125774 -0.194478 ... -0.097409 -0.008196
## diabetes -0.115747 -0.022903 ... 0.007914 -0.061488
## ejection_fraction 0.056246 0.040359 ... 0.055527 0.010200
## serum_sodium -0.060176 0.070946 ... 0.050072 0.036309
## high_blood_pressure 0.073552 0.023697 ... 1.000000 -0.046882
## smoking 0.008866 -0.084894 ... -0.046882 1.000000
##
## [8 rows x 8 columns]
There is no collinearity amongst the variables.
Now, we’re going to use SequentialFeatureSelector(SFS) from the mlxtend library, which is a Python library of data science tools. SFS is a greedy procedure where, at each iteration, we choose the best new feature to add to our selected features based on a cross-validation score. For forward selection, we start with 0 features and choose the best single feature with the highest score. The procedure is repeated until we reach the desired number of selected features. We will use the “best” option, where the selector returns the feature subset with the best cross-validation performance.
SequentialFeatureSelector(estimator=LogisticRegression(max_iter=1000),
k_features=(1, 8), scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. SequentialFeatureSelector(estimator=LogisticRegression(max_iter=1000),
k_features=(1, 8), scoring='accuracy')LogisticRegression(max_iter=1000)
LogisticRegression(max_iter=1000)
## ('age', 'creatinine_phosphokinase', 'ejection_fraction', 'serum_sodium')
Let’s do a little experiment and see which features are selected when we use the raw data before SMOTE was applied:
SequentialFeatureSelector(estimator=LogisticRegression(max_iter=1000),
k_features=(1, 8), scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. SequentialFeatureSelector(estimator=LogisticRegression(max_iter=1000),
k_features=(1, 8), scoring='accuracy')LogisticRegression(max_iter=1000)
LogisticRegression(max_iter=1000)
## ('age', 'diabetes', 'ejection_fraction', 'high_blood_pressure')
The results are different. We see the importance of data pre-processing and feature engineering before rushing ahead with machine learning.
We will go with the features selected from the SMOTE-transformed data.
The following is just to create a dictionary of each model as the key and its metrics as the values
## {'LogisticRegression(max_iter=1000)': [73.33333333333333, 15.789473684210526, 100.0], 'SVC()': [83.33333333333334, 23.076923076923077, 100.0], 'KNeighborsClassifier()': [65.0, 5.0, 33.33333333333333], 'DecisionTreeClassifier()': [76.66666666666667, 13.333333333333334, 66.66666666666666], 'RandomForestClassifier()': [68.33333333333333, 13.636363636363635, 100.0], 'GradientBoostingClassifier()': [90.0, 28.57142857142857, 66.66666666666666]}
Converting the dictionary into a dataframe for better visual exploration of models and their metrics.
## index Accuracy Precision Recall
## 0 LogisticRegression(max_iter=1000) 73.333333 15.789474 100.000000
## 1 SVC() 83.333333 23.076923 100.000000
## 2 KNeighborsClassifier() 65.000000 5.000000 33.333333
## 3 DecisionTreeClassifier() 76.666667 13.333333 66.666667
## 4 RandomForestClassifier() 68.333333 13.636364 100.000000
## 5 GradientBoostingClassifier() 90.000000 28.571429 66.666667
From the above, how do we choose the best model for this problem?
Let’s use the metrics of the logistic regression for example:
Accuracy:
Precision:
Recall: A recall of 69.77% means it accurately identifies 69.77% of all those who died of heart failure.
#How to set SMOTE so the model metrics are he same everytime the script is run.